Lesson 3 Problem Set


price vs. x Quiz

In this problem set, you’ll continue to explore the diamonds data set.

Your first task is to create a scatterplot of price vs x. using the ggplot syntax.

# You need to run these commands each time you launch RStudio to access the diamonds data set. RStudio won't remember which packages and data sets you loaded unless you change your preferences or save your workspace. 
setwd("~/Desktop/Nanodegrees/Data Analyst/Task 4/Data Analysis with R/Explore Two Variables/Files/")
library(ggplot2) # must load the ggplot package first 
data(diamonds) # loads the diamonds data set since it comes with the ggplot package 
#summary(diamonds)

ggplot(aes(x, price), data = diamonds) +
  geom_point(alpha = 0.05)


Findings - price vs. x

The plot shows that most diamonds have a length between 4 and 8mm, and that the diamonds price increases as length gets longer, with the highest priced diamonds having the longest length.

There is an exponential relationship between price and x and there are some outliers which are very large and high priced.


Correlations Quiz

# Betwen price and x
with(diamonds, cor.test(price, x, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  price and x
## t = 440.16, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8825835 0.8862594
## sample estimates:
##       cor 
## 0.8844352
# Betwen price and y
with(diamonds, cor.test(price, y, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  price and y
## t = 401.14, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8632867 0.8675241
## sample estimates:
##       cor 
## 0.8654209
# Betwen price and z
with(diamonds, cor.test(price, z, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  price and z
## t = 393.6, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8590541 0.8634131
## sample estimates:
##       cor 
## 0.8612494

What is the correlation between price and x?

0.88

What is the correlation between price and y?

0.87

What is the correlation between price and z?

0.86


price vs. depth Quiz

Create a simple scatter plot of price vs depth.

names(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
ggplot(aes(x = depth, y = price), data = diamonds) +
  geom_point()


Adjustments - price vs. depth Quiz

Change the code to make the transparency of the points to be 1/100 of what they are now and mark the x-axis every 2 units. Hint 1: Use the alpha parameter in geom_point() to adjust the transparency of points.

Hint 2: Use scale_x_continuous() with the breaks parameter to adjust the x-axis.

ggplot(aes(x = depth, y = price), data = diamonds) +
  geom_point(alpha = 0.01) + 
  scale_x_continuous(breaks = seq(0,80,2))


Typical Depth Range Quiz

Based on the scatterplot of depth vs price, most diamonds are between what values of depth?

59 to 64mm


Correlation - price and depth Quiz

# Correlation of depth vs price
with(diamonds, cor.test(depth, price, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  depth and price
## t = -2.473, df = 53938, p-value = 0.0134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.019084756 -0.002208537
## sample estimates:
##        cor 
## -0.0106474

What’s the correlation of depth vs price?

-0.01

Based on the correlation coefficient, would you use depth to predict the price of a diamond?

I wouldn’t because a score of -0.01 implies no correlation.


price vs. carat Quiz

Create a scatterplot of price vs carat and omit the top 1% of price and carat values.

Both the graphs below are the same, they have different scales though.

ggplot(aes(carat, price), data = diamonds) +
  geom_point(alpha = 0.1) +
  coord_cartesian(xlim = c(0, quantile(diamonds$carat, 0.99)),
                  ylim = c(0, quantile(diamonds$price, 0.99)))

ggplot(aes(x = carat, y = price), 
       data = subset(diamonds, diamonds$price < quantile(diamonds$price, 0.99) &
                       diamonds$carat < quantile(diamonds$carat, 0.99))) + 
  geom_point(alpha = 0.1)


price vs. volume Quiz

Create a scatterplot of price vs. volume (x * y * z). This is a very rough approximation for a diamond’s volume.

Create a new variable for volume in the diamonds data frame. This will be useful in a later exercise.

# Volume variable
diamonds$vol <- (diamonds$x * diamonds$y) * diamonds$z

ggplot(aes(x = vol, y = price),
       data = diamonds) + 
  geom_point(alpha = 0.02) +
  coord_cartesian(xlim = c(40, quantile(diamonds$vol, 0.95)),
                  ylim = c(0, quantile(diamonds$price, 0.95))) +
  geom_smooth(color = 'red')

ggplot(aes(x = vol, y = price),
       data = diamonds) + 
  geom_point() +
  geom_smooth(color = 'red')


Findings - price vs. volume Quiz

I found that there seems to be a correlation between volume and price. As the volume exceeds 200, the line of best fit shows that the price is going up.

Diamonds with a volume of 200+ are rare and the majority of diamonds have a volume of between 45 - 155.

There seems to be some vertical lines that are overplotted, this is probably because there are standard cut volumes and the diamonds have to fall within the range (ie. around 165, 150, 125).

There is an outlier with a volume near 4000 and a cheaper diamond with a volume near 900.


Correlations on Subsets Quiz

with(subset(diamonds, vol > 0 & vol < 800), cor.test(vol, price))
## 
##  Pearson's product-moment correlation
## 
## data:  vol and price
## t = 559.19, df = 53915, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9222944 0.9247772
## sample estimates:
##       cor 
## 0.9235455

What’s the correlation of price and volume? Exclude diamonds that have a volume of 0 or that are greater than or equal to 800.

0.9235455


Adjustments - price vs. volume Quiz

Subset the data to exclude diamonds with a volume greater than or equal to 800. Also, exclude diamonds with a volume of 0. Adjust the transparency of the points and add a linear model to the plot. (Types of smoothers in ggplot2.)

Stat_smooth and Geom_smooth are aliases of each other.

ggplot(aes(x = vol, y = price),
       data = subset(diamonds,
                     diamonds$vol < 800 & diamonds$vol > 0)) + 
  geom_point(alpha = 0.02) +
  geom_smooth(color = 'red') 

Do you think this would be a useful model to estimate the price of diamonds? Why or why not?

I don’t think it’s a useful model because after a volume of 400, the price begins to drop and I find that relationship to be unlikely given the correlation between price and volume > 0 & < 800.


Mean Price by Clarity Quiz

Use the function dplyr package to create a new data frame containing info on diamonds by clarity.

Name the data frame diamondsByClarity.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
diamondsByClarity <- diamonds %>%
  group_by(clarity) %>%
  summarise(mean_price = mean(price),
            median_price = median(price),
            min_price = min(price),
            max_price = max(price),
            n = n())
head(diamondsByClarity)
## # A tibble: 6 × 6
##   clarity mean_price median_price min_price max_price     n
##     <ord>      <dbl>        <dbl>     <int>     <int> <int>
## 1      I1   3924.169         3344       345     18531   741
## 2     SI2   5063.029         4072       326     18804  9194
## 3     SI1   3996.001         2822       326     18818 13065
## 4     VS2   3924.989         2054       334     18823 12258
## 5     VS1   3839.455         2005       327     18795  8171
## 6    VVS2   3283.737         1311       336     18768  5066

Bar Charts of Mean Price Quiz

We’ve created summary data frames with the mean price by clarity and color.

Your task is to write additional code to create two bar plots on one output image using the grid.arrange() function from the package gridExtra.

Bar Charts vs. Histograms

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## Summary data frames with mean price by clarity and color
diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))

diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))

diamonds_by_cut <- group_by(diamonds, cut)
diamonds_mp_by_cut <- summarise(diamonds_by_cut, mean_price = mean(price))

head(diamonds_mp_by_color)
## # A tibble: 6 × 2
##   color mean_price
##   <ord>      <dbl>
## 1     D   3169.954
## 2     E   3076.752
## 3     F   3724.886
## 4     G   3999.136
## 5     H   4486.669
## 6     I   5091.875
head(diamonds_mp_by_clarity)
## # A tibble: 6 × 2
##   clarity mean_price
##     <ord>      <dbl>
## 1      I1   3924.169
## 2     SI2   5063.029
## 3     SI1   3996.001
## 4     VS2   3924.989
## 5     VS1   3839.455
## 6    VVS2   3283.737
## Creating bar plots
p1 <- ggplot(aes(color, mean_price), data=diamonds_mp_by_color) +
  geom_bar(stat='identity')

p2 <- ggplot(aes(clarity, mean_price), data = diamonds_mp_by_clarity) +
  geom_bar(stat='identity')

p3 <- ggplot(aes(cut, mean_price), data = diamonds_mp_by_cut) +
  geom_bar(stat='identity')

grid.arrange(p1, p2, p3)


Gapminder Revisited Quiz

The Gapminder website contains over 500 data sets with information about the world’s population. Your task is to continue the investigation you did at the end of Problem Set 3.

If you’re feeling adventurous or want to try some data munging see if you can find a data set or scrape one from the web.

In your investigation, examine pairs of variable and create 2-5 plots that make use of the techniques from Lesson 4.

library(tidyr)
library(dplyr)
library(ggplot2)
library(gridExtra)
list.files()
## [1] "correlation_images.jpeg"              
## [2] "E_2_Problem_Set_files"                
## [3] "E_2_Problem_Set.html"                 
## [4] "E_2_Problem_Set.rmd"                  
## [5] "indicatorwdigdp_percapita_growth.xlsx"
## [6] "Investment.xlsx"                      
## [7] "lesson4_student.html"                 
## [8] "lesson4_student.rmd"                  
## [9] "pseudo_facebook.tsv"
# Reading excel file into studio
library('rio')
gdp = import('indicatorwdigdp_percapita_growth.xlsx')
invest = import('Investment.xlsx')

# Tidying data. GDI stands for Gross Domestic Investment
gdp = gather(gdp, "year", "GDP", 2:53)
invest = gather(invest, "year", "GDI", 2:53)

# Changing column names in both
colnames(gdp) <- c("country","year","GDP")
colnames(invest) <- c("country","year","GDI")

# Merging datasets by year
data <- merge(gdp, invest, by = c("country","year"))
data <- filter(data, !is.na(data$GDI))
data <- filter(data, !is.na(data$GDP))

# Seeing the correlation
with(data, cor.test(GDP, GDI))
## 
##  Pearson's product-moment correlation
## 
## data:  GDP and GDI
## t = 18.877, df = 6639, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2027515 0.2484058
## sample estimates:
##       cor 
## 0.2257026
# Plotting a scatter plot
ggplot(aes(x = GDI, y = GDP), data = data) +
  geom_point(alpha = 0.2) +
  geom_smooth()

# Understanding how the mean and median of GDP and GDI varies with countries
data.country_by_gdi <- data %>%
  group_by(country) %>%
  summarise(gdi_mean = mean(GDI),
            gdi_median = median(GDI),
            gdp_mean = mean(GDP),
            gdp_median = median(GDP),
            n = n()) %>%
  arrange(country)

p1 <- ggplot(aes(x = gdp_mean, y = gdi_mean),
       data = data.country_by_gdi) +
  geom_line()

p2 <- ggplot(aes(x = gdp_median, y = gdi_median),
       data = data.country_by_gdi) +
  geom_line()

grid.arrange(p1,p2, ncol = 1)

# Plotting a more in depth scatter plot, with a line plot overlayed
ggplot(aes(x = gdi_mean, y = gdp_mean), data = data.country_by_gdi) +
  geom_point(alpha = 0.2) +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
            linetype = 2, color = 'red') + 
  geom_smooth()

The variable(s) you investigated and your observations (with any summary statistics).

I investigated the relationship between gdp growth and gdp investment.

The correlation between GDP and GDI is 0.2257026.

The First Scatter plot indicates that most of the GDI is between 0 to 40. Most of the GDP data is between -15 and +15. GDI seemingly has little influence on the GDP as the GDP has a lot of -ve and +ve growth regardless. There are also a few anomalies such as a GDI of 40 with a resulting GDP of 75 or a GDI of 120 and a GDP of 25.

As can be seen in the second graph, the GDP and GDI means and medians vary greatly at each area.

The overlapping scatterplot shows that the gdp and gdi means have a positive correlation.